Clustering - Sleep Recommendations

This notebook is used to produce results related to clustering of data from the fitbit vitals data loaded from the corresponding pickle files and using sleep efficiency labels to then further find cluster impurities, distrinution and good sleep reciepes

Importing Required Libraries

In [1]:
# Importing scientific libarires required for analysis and handling data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# Importing libraries related to handling of files and directory
import os
import glob
import pickle
import random

# Importing utility functions from the code base
from utils.directory_utils import *
from utils.general_utils import *
from utils.sleep_utils import *
from data_preprocessor.get_user_data import *
from clustering_utils import *
from kmeans_dm import *

# Importing Machine Learning utilities
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from tslearn.clustering import TimeSeriesKMeans
from sklearn.decomposition import PCA
from statsmodels.tsa.seasonal import seasonal_decompose
from scipy.stats import boxcox
from scipy.spatial import distance
from tslearn.metrics import dtw, cdist_dtw
from sklearn.metrics import silhouette_score
from scipy.stats import entropy

Data

This section loads all different types of data from the pickle files that we have saved and then loads the relevant data into numpy array for further analysis

Heart Rate, Sleep, Calories and Activity Time Series Data

User Data Loader

In [27]:
# First we load the data for each user seperately from their own numpy array and then stack them to get the final array
numpy_array_directory = f'../data/data_numpy_arrays/'

heart_rate_ts_data = []
calories_ts_data = []
activity_label_ts_data = []
activity_percentages = []
sleep_effeciency_ratio = []
sleep_stages_summary = []

for user_name in get_subdirectory_nms(numpy_array_directory):
#     if user_name in  ['Meghna\\']:
#         continue
    user_directory = construct_path(numpy_array_directory, user_name)

    user_heart_rate_ts_data = np.load(construct_path(user_directory, f'heart_rate_ts_data.npy'))
    user_calories_ts_data = np.load(construct_path(user_directory, f'calories_ts_data.npy'))
    user_activity_label_ts_data = np.load(construct_path(user_directory, f'activity_label_ts_data.npy'))
    user_activity_percentages = np.load(construct_path(user_directory, f'activity_percentages.npy'))
    user_sleep_effeciency_ratio = np.load(construct_path(user_directory, f'sleep_efficiency_ratio.npy'))
    user_sleep_stages_summary = pd.read_csv(construct_path(user_directory, f'sleep_stages_summary.csv'))

    heart_rate_ts_data.append(user_heart_rate_ts_data)
    calories_ts_data.append(user_calories_ts_data)
    activity_label_ts_data.append(user_activity_label_ts_data)
    activity_percentages.append(user_activity_percentages)
    sleep_effeciency_ratio.append(user_sleep_effeciency_ratio)
    sleep_stages_summary.append(user_sleep_stages_summary)

heart_rate_ts_data = np.vstack(heart_rate_ts_data)[:, :]
calories_ts_data = np.vstack(calories_ts_data)[:, :]
activity_label_ts_data = np.vstack(activity_label_ts_data)[:, :]
activity_percentages = np.vstack(activity_percentages)
sleep_effeciency_ratio = np.hstack(sleep_effeciency_ratio)
sleep_stages_summary = pd.concat(sleep_stages_summary)
In [26]:
activity_percentages = activity_percentages * 1440 / 100

Check for the shape of all the arrays and dataframes

In [28]:
# Check for the shape of all the arrays and dataframes
heart_rate_ts_data.shape, calories_ts_data.shape, activity_label_ts_data.shape, sleep_effeciency_ratio.shape, sleep_stages_summary.shape
Out[28]:
((272, 1440), (272, 1440), (272, 1440), (272,), (272, 4))
In [29]:
# Make sure activity value does not have a nan field (not sure how we would fill this)
print(np.isnan(activity_label_ts_data).any())
# Check that no nans in any of the data
np.isnan(heart_rate_ts_data).any(), np.isnan(calories_ts_data).any()
False
Out[29]:
(False, False)

Transformations

This section uses different ways to transform the original time series data

This section will essentially find the trends from the original data

In [30]:
trend_window_length = 10
In [31]:
heart_trends = []
counter = 0
for day in heart_rate_ts_data:
    counter += 1
    result = seasonal_decompose(day, model='additive', freq=trend_window_length, extrapolate_trend='freq')
    heart_trends.append(result.trend)
heart_trends = np.array(heart_trends)
heart_trends = remove_nans_from_array(heart_trends)
# Make sure the shape is same and there are no nan values
heart_trends.shape, np.isnan(heart_trends).any()
Out[31]:
((272, 1440), False)
In [32]:
# plotting heart trends to asses the fit to the overall data
plt.plot(heart_rate_ts_data[0, :])
plt.plot(heart_trends[0, :])
Out[32]:
[<matplotlib.lines.Line2D at 0x23f8731a5f8>]
In [33]:
calories_trends = []
for day in calories_ts_data:
    result = seasonal_decompose(day, model='additive', freq=trend_window_length, extrapolate_trend='freq')
    calories_trends.append(result.trend)
calories_trends = np.array(calories_trends)
calories_trends = remove_nans_from_array(calories_trends)
# Make sure the shape is same and there are no nan values
calories_trends.shape, np.isnan(calories_trends).any()
Out[33]:
((272, 1440), False)
In [34]:
# plotting caloires trends to asses the fit to the overall data
plt.plot(calories_ts_data[0, :])
plt.plot(calories_trends[0, :])
Out[34]:
[<matplotlib.lines.Line2D at 0x23f8582b630>]

Chipping the Data

This section chips away some heart data

In [35]:
heart_trends = heart_trends[:, 360:1080]
calories_trends = calories_trends[:, 360:1080]
heart_trends.shape, calories_trends.shape
Out[35]:
((272, 720), (272, 720))

Dimensionality Reduction

This section will reduce the dimensions of the arrays so that we can easily apply different clustering techniques on them

In [36]:
mean_window_length = 15
In [37]:
# Reduce the dimension of the arrays
reduced_heart_trends = reduce_time_series_dimension(heart_trends, mean_window_length, hours=12)
reduced_calories_trends = reduce_time_series_dimension(calories_trends, mean_window_length, hours=12)
# Check for the shape of the arrays
reduced_heart_trends.shape, reduced_calories_trends.shape
Out[37]:
((272, 48), (272, 48))

Sleep Labels

In this section of the notebook we try to find the optimal boundary for constructing the sleep labels using different techniques

In [38]:
# Constructing a histogram plot for the sleep efficiency ratio.
# Sleep Efficiency Ratio is found as total_time_asleep / total_time_in_bed
sns.distplot(sleep_effeciency_ratio)
plt.xlabel('Sleep Efficiency')
plt.ylabel('Frequency')
plt.title('Sleep Efficiency Histogram')
Out[38]:
Text(0.5, 1.0, 'Sleep Efficiency Histogram')
In [39]:
# Constructing a histogram plot for the different sleep stages.
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(sleep_stages_summary['wake'], ax = ax[0, 0])
ax[0, 0].set_xlabel('Minutes Awake')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('Minutes Awake Histogram')

sns.distplot(sleep_stages_summary['light'], ax = ax[0, 1])
ax[0, 1].set_xlabel('Minutes in Light Sleep')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('Minutes in Light Sleep Histogram')

sns.distplot(sleep_stages_summary['rem'], ax = ax[1, 0])
ax[1, 0].set_xlabel('Minutes in Rem Sleep')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('Minutes in REM Sleep Histogram')

sns.distplot(sleep_stages_summary['deep'], ax = ax[1, 1])
ax[1, 1].set_xlabel('Minutes in Deep Sleep')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('Minutes in Deep Sleep Histogram')
Out[39]:
Text(0.5, 1.0, 'Minutes in Deep Sleep Histogram')

Gap Definition For Sleep Efficiency

Create a gap of certain length: Which will be a parameter

Example: 0.05 - 0.875 and above, 0.825 and below

In [40]:
final_sleep_labels = sleep_effeciency_ratio > 0.89
sns.distplot(np.array(final_sleep_labels, dtype=np.int), kde=False)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x23f87a7f240>

HeatMap for Euclidean and DTW Distances

In [41]:
good_sleep_heart_trends = reduced_heart_trends[final_sleep_labels]
poor_sleep_heart_trends = reduced_heart_trends[~final_sleep_labels]
print(good_sleep_heart_trends.shape, poor_sleep_heart_trends.shape)
ordered_heart_trends = np.vstack((good_sleep_heart_trends, poor_sleep_heart_trends))
print(ordered_heart_trends.shape)
(137, 48) (135, 48)
(272, 48)
In [42]:
good_sleep_calories_trends = reduced_calories_trends[final_sleep_labels]
poor_sleep_calories_trends = reduced_calories_trends[~final_sleep_labels]
print(good_sleep_calories_trends.shape, poor_sleep_calories_trends.shape)
ordered_calories_trends = np.vstack((good_sleep_calories_trends, poor_sleep_calories_trends))
print(ordered_calories_trends.shape)
(137, 48) (135, 48)
(272, 48)
In [20]:
%%time
dtw_dist_heart = cdist_dtw(ordered_heart_trends)
dtw_dist_calories = cdist_dtw(ordered_calories_trends)
euc_dist_heart = distance.cdist(ordered_heart_trends, ordered_heart_trends)
euc_dist_calories = distance.cdist(ordered_calories_trends, ordered_calories_trends)
Wall time: 4min 48s
In [21]:
m_dist_heart = distance.cdist(ordered_heart_trends, ordered_heart_trends, 'mahalanobis')
m_dist_calories = distance.cdist(ordered_calories_trends, ordered_calories_trends, 'mahalanobis')
l1_dist_heart = distance.cdist(ordered_heart_trends, ordered_heart_trends, 'minkowski', p=1)
l1_dist_calories = distance.cdist(ordered_calories_trends, ordered_calories_trends, 'minkowski', p=1)
In [22]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.heatmap(dtw_dist_heart, xticklabels=10, yticklabels=10, ax=ax[0])
ax[0].set_title('All Sleep DTW Distance Cross Matrix for Heart Trends')
sns.heatmap(dtw_dist_calories, xticklabels=10, yticklabels=10, ax=ax[1])
ax[1].set_title('All Sleep DTW Distance Cross Matrix for Calories Trends')
Out[22]:
Text(0.5, 1.0, 'All Sleep DTW Distance Cross Matrix for Calories Trends')
In [23]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.heatmap(euc_dist_heart, xticklabels=10, yticklabels=10, ax=ax[0])
ax[0].set_title('All Sleep Euclidean Distance Cross Matrix for Heart Trends')
sns.heatmap(euc_dist_calories, xticklabels=10, yticklabels=10, ax=ax[1])
ax[1].set_title('All Sleep Euclidean Distance Cross Matrix for Calories Trends')
Out[23]:
Text(0.5, 1.0, 'All Sleep Euclidean Distance Cross Matrix for Calories Trends')
In [22]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.heatmap(m_dist_heart, xticklabels=10, yticklabels=10, ax=ax[0])
ax[0].set_title('All Sleep Mahalanobis Distance Cross Matrix for Heart Trends')
sns.heatmap(m_dist_calories, xticklabels=10, yticklabels=10, ax=ax[1])
ax[1].set_title('All Sleep Mahalanobis Distance Cross Matrix for Calories Trends')
Out[22]:
Text(0.5, 1.0, 'All Sleep Mahalanobis Distance Cross Matrix for Calories Trends')
In [23]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.heatmap(l1_dist_heart, xticklabels=10, yticklabels=10, ax=ax[0])
ax[0].set_title('All Sleep L1 Norm Distance Cross Matrix for Heart Trends')
sns.heatmap(l1_dist_calories, xticklabels=10, yticklabels=10, ax=ax[1])
ax[1].set_title('All Sleep L1 Norm Distance Cross Matrix for Calories Trends')
Out[23]:
Text(0.5, 1.0, 'All Sleep L1 Norm Distance Cross Matrix for Calories Trends')

Activity Percentages

In this section of the notebook we aggregate the activity labels of a person from minute level to percentage level

In [43]:
# Constructing a histogram plot for the different activity level percentages.
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[:, 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[:, 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[:, 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[:, 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[43]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [44]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[~final_sleep_labels, 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[final_sleep_labels, 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[~final_sleep_labels, 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[final_sleep_labels, 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[~final_sleep_labels, 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[final_sleep_labels, 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[~final_sleep_labels, 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[final_sleep_labels, 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% High Activity Histogram')
ax[1, 1].legend()
Out[44]:
<matplotlib.legend.Legend at 0x23f88a7e828>

Clustering

In this section of the notebook we apply different clustering techniques on the data that we have got and see what are the different recipes

In [46]:
num_master_clusters = 4
num_activity_clusters = 4

K-Means - Euclidean

Here we apply K-Means on the data with euclidean (L-2 Norm) as the distance metric

Getting the Best Model

In [47]:
kmeans_mod = get_best_clustering_model(lambda num_clusters: KMeans(num_clusters), reduced_heart_trends)

Fitting the Model

In [48]:
# Set the seed so that get the same clustering everytime
# random.seed(2)
# np.random.seed(1000)
# Performing the Clustering
# kmeans_mod = KMeans(n_clusters=num_master_clusters)
kmeans_mod.fit(reduced_heart_trends)
cluster_assignments = kmeans_mod.predict(reduced_heart_trends)
sil_score = silhouette_score(reduced_heart_trends, cluster_assignments)
print(kmeans_mod.n_clusters, sil_score)
np.unique(cluster_assignments, return_counts=True)
2 0.18620674371673168
Out[48]:
(array([0, 1]), array([176,  96], dtype=int64))
In [49]:
# Update the number of activity clusters based on the minimum amount of records assigned to a cluster
num_activity_clusters = min(num_activity_clusters, *(np.unique(cluster_assignments, return_counts=True)[1]))
print('Updated Number of activity clusters:', num_activity_clusters)
Updated Number of activity clusters: 4
In [50]:
# Visualizing the number of points in each cluster
sns.distplot(cluster_assignments, kde=False)
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x23f88a36550>

Visualization of Clusters

In [51]:
# Simple Cluster Visualization
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
plt.figure(figsize=(7, 5))
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=cluster_assignments, style=cluster_assignments)
plt.xlabel('PCA Dim 1')
plt.ylabel('PCA Dim 2')
plt.title('Clusters Visualized')
plt.legend([f'Cluster: {i+1}' for i in range(4)])
Out[51]:
<matplotlib.legend.Legend at 0x23f8a156da0>
In [52]:
# Cluster Visualization based on Sleep Efficiency
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
plt.figure(figsize=(7, 5))
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=final_sleep_labels, style=cluster_assignments)
plt.xlabel('PCA Dim 1')
plt.ylabel('PCA Dim 2')
plt.title('Clusters Visualized')
plt.legend([])
Out[52]:
<matplotlib.legend.Legend at 0x23f8a203550>
In [53]:
fig, ax = plt.subplots(1, 2, figsize=(15, 7))

# Simple Cluster Visualization
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=cluster_assignments, style=cluster_assignments, ax=ax[0])
ax[0].set_xlabel('PCA Dim 1')
ax[0].set_ylabel('PCA Dim 2')
ax[0].set_title('Clusters Visualized')
ax[0].legend([f'Cluster: {i+1}' for i in range(4)])

# Cluster Visualization based on Sleep Efficiency
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=final_sleep_labels, style=cluster_assignments, ax=ax[1])
ax[1].set_xlabel('PCA Dim 1')
ax[1].set_ylabel('PCA Dim 2')
ax[1].set_title('Clusters Visualized')
ax[1].legend([])
Out[53]:
<matplotlib.legend.Legend at 0x23f8a26bdd8>

Cluster Purity

Finding cluster purity based on the sleep labels

In [54]:
# Clustering Purity is defined by ratio of dominant class of sleep label instance in the cluster 
# to total number of instances in the cluster
for master_cluster_num in range(len(kmeans_mod.cluster_centers_)):
    cluster_sleep_labels = final_sleep_labels[cluster_assignments == master_cluster_num]
    pos_sleep_label_purity = sum(cluster_sleep_labels) / cluster_sleep_labels.shape[0]
    print(f'Cluster Number: {master_cluster_num}, Purity:', max(pos_sleep_label_purity, 1 - pos_sleep_label_purity))
Cluster Number: 0, Purity: 0.6534090909090909
Cluster Number: 1, Purity: 0.7708333333333334
In [37]:
# Constructing a histogram plot for visualizing the sleep efficiency cluster purity in all cluster.
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(np.array(final_sleep_labels[cluster_assignments==0], dtype=np.int16), ax = ax[0, 0], kde=False)
ax[0, 0].set_xlabel('Good Sleep?')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('Cluster 1')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==1], dtype=np.int16), ax = ax[0, 1], kde=False)
ax[0, 1].set_xlabel('Good Sleep?')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('Cluster 2')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==2], dtype=np.int16), ax = ax[1, 0], kde=False)
ax[1, 0].set_xlabel('Good Sleep?')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('Cluster 3')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==3], dtype=np.int16), ax = ax[1, 1], kde=False)
ax[1, 1].set_xlabel('Good Sleep?')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('Cluster 4')
C:\Users\Saksham\Anaconda3\lib\site-packages\seaborn\distributions.py:198: RuntimeWarning: Mean of empty slice.
  line, = ax.plot(a.mean(), 0)
C:\Users\Saksham\Anaconda3\lib\site-packages\numpy\core\_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Out[37]:
Text(0.5, 1.0, 'Cluster 4')

Activity Histograms for Clusters

Cluster: 1

In [38]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==0), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[38]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [39]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[39]:
<matplotlib.legend.Legend at 0x211a8fb86a0>

Cluster: 2

In [40]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==1), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[40]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [41]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[41]:
<matplotlib.legend.Legend at 0x211a959c470>

Cluster: 3

In [42]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==2), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
C:\Users\Saksham\Anaconda3\lib\site-packages\numpy\core\_methods.py:85: RuntimeWarning: invalid value encountered in true_divide
  ret = ret.dtype.type(ret / rcount)
C:\Users\Saksham\Anaconda3\lib\site-packages\numpy\lib\histograms.py:823: RuntimeWarning: invalid value encountered in true_divide
  return n/db/n.sum(), bin_edges
Out[42]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [43]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[43]:
<matplotlib.legend.Legend at 0x211a9d7d4a8>

Cluster: 4

In [44]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==3), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[44]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [45]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[45]:
<matplotlib.legend.Legend at 0x211aa83ea58>

Sub-Clustering on Activity Data

In [55]:
sub_clusters = activity_percentage_clusterer(KMeans(n_clusters=num_activity_clusters), cluster_assignments, activity_percentages)
In [56]:
# Sanity Check for the number of points in each cluster
print(np.unique(cluster_assignments, return_counts=True))
for sub_cluster in sub_clusters:
    print(sub_cluster.shape)
(array([0, 1]), array([176,  96], dtype=int64))
(176,)
(96,)
Cluster Purity in each subcluster
In [57]:
# Clustering Purity is defined by ratio of dominant class of sleep label instance in the cluster
# to total number of instances in the cluster
for index, sub_cluster in enumerate(sub_clusters):
    print('Master Cluster:', index+1)
    cluster_sleep_labels = final_sleep_labels[(cluster_assignments == index)]
    for sub_cluster_assignment in range(num_activity_clusters):
        sub_cluster_sleep_labels = cluster_sleep_labels[(sub_cluster==sub_cluster_assignment)]
        try:
            pos_sleep_label_purity = sum(sub_cluster_sleep_labels) / sub_cluster_sleep_labels.shape[0]
            print(f'Sub Cluster Number: {sub_cluster_assignment}, Purity:', max(pos_sleep_label_purity, 0))#, 1 - pos_sleep_label_purity))
        except:
            print(f'Sub Cluster Number: {sub_cluster_assignment}, No Points assigned')
Master Cluster: 1
Sub Cluster Number: 0, Purity: 0.6538461538461539
Sub Cluster Number: 1, Purity: 0.6153846153846154
Sub Cluster Number: 2, Purity: 0.6491228070175439
Sub Cluster Number: 3, Purity: 0.6785714285714286
Master Cluster: 2
Sub Cluster Number: 0, Purity: 0.22916666666666666
Sub Cluster Number: 1, Purity: 0.1875
Sub Cluster Number: 2, Purity: 0.14285714285714285
Sub Cluster Number: 3, Purity: 0.4444444444444444
In [58]:
sleep_recipes = get_good_sleep_recipes(cluster_assignments, sub_clusters, activity_percentages, final_sleep_labels)
sleep_recipes
Cluster: 0, Sub Cluster: 3, Good Ratio: 2.111111111111111
Out[58]:
array([[65.8  , 31.2  ,  1.502,  1.508]], dtype=float16)
In [59]:
plt.figure(0)
plt.bar(['S', 'L', 'M', 'V'], (sleep_recipes / 1440 * 100)[0])
plt.figure(1)
plt.bar(['S', 'L', 'M', 'V'], (sleep_recipes / 1440 * 100)[1])
plt.figure(2)
plt.bar(['S', 'L', 'M', 'V'], (sleep_recipes / 1440 * 100)[2])
plt.figure(3)
plt.bar(['S', 'L', 'M', 'V'], (sleep_recipes / 1440 * 100)[3])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-59-23c52c6cf0f2> in <module>
      2 plt.bar(['S', 'L', 'M', 'V'], (sleep_recipes / 1440 * 100)[0])
      3 plt.figure(1)
----> 4 plt.bar(['S', 'L', 'M', 'V'], (sleep_recipes / 1440 * 100)[1])
      5 plt.figure(2)
      6 plt.bar(['S', 'L', 'M', 'V'], (sleep_recipes / 1440 * 100)[2])

IndexError: index 1 is out of bounds for axis 0 with size 1
<Figure size 432x288 with 0 Axes>

K-Means - DTW

Here we apply K-Means on the data with Dynamic Time Wrapping (DTW) as the distance metric

In [60]:
num_activity_clusters = 2

Fitting the Model

In [62]:
clusterer = get_best_clustering_model(lambda num_clusters: TimeSeriesKMeans(num_clusters, metric='dtw', max_iter=50), 
                                       reduced_heart_trends, cluster_range=range(2, 3))
3362.546 --> 1799.753 --> 1775.146 --> 1742.935 --> 1729.964 --> 1690.833 --> 1663.147 --> 1662.270 --> 1662.270 --> 
2485.022 --> 1569.525 --> 1525.967 --> 1499.003 --> 1498.005 --> 1497.934 --> 1497.934 --> 
1890.939 --> 1433.754 --> 1389.436 --> 1362.015 --> 1357.174 --> 1356.049 --> 1354.176 --> 1351.909 --> 1351.123 --> 1350.591 --> 1350.325 --> 1350.048 --> 1349.909 --> 1349.909 --> 
1832.166 --> 1359.409 --> 1318.002 --> 1302.845 --> 1297.364 --> 
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_code(self, code_obj, result, async_)
   3266                 else:
-> 3267                     exec(code_obj, self.user_global_ns, self.user_ns)
   3268             finally:

<ipython-input-62-54d192905ea7> in <module>
      1 clusterer = get_best_clustering_model(lambda num_clusters: TimeSeriesKMeans(num_clusters, metric='dtw', max_iter=50), 
----> 2                                        reduced_heart_trends, cluster_range=range(2, 3))

D:\GIT\healthRecSys\src\clustering_utils.py in get_best_clustering_model(cluster_model_getter, data, cluster_range, sil_score_distance_metric)
     32                                               range(list(cluster_range)[-0], list(cluster_range)[-1] + 5),
---> 33 			                      sil_score_distance_metric)
     34         return cluster_model_getter(best_number_clusters)

D:\GIT\healthRecSys\src\clustering_utils.py in get_best_num_clusters(orig_data, cluster_getter_func, number_of_cluster_range, sil_score_distance_metric)
     15                 clusterer = cluster_getter_func(num_clusters)
---> 16                 cluster_labels = clusterer.fit_predict(orig_data)
     17                 sil_score = silhouette_score(orig_data, cluster_labels, metric=sil_score_distance_metric)

~\Anaconda3\lib\site-packages\tslearn\clustering.py in fit_predict(self, X, y)
    643         """
--> 644         return self.fit(X, y).labels_
    645 

~\Anaconda3\lib\site-packages\tslearn\clustering.py in fit(self, X, y)
    617                 n_attempts += 1
--> 618                 self._fit_one_init(X_, x_squared_norms, rs)
    619                 if self.inertia_ < min_inertia:

~\Anaconda3\lib\site-packages\tslearn\clustering.py in _fit_one_init(self, X, x_squared_norms, rs)
    545                 print("%.3f" % self.inertia_, end=" --> ")
--> 546             self._update_centroids(X)
    547 

~\Anaconda3\lib\site-packages\tslearn\clustering.py in _update_centroids(self, X)
    582                                                                     init_barycenter=self.cluster_centers_[k],
--> 583                                                                     verbose=False)
    584                     # DTWBarycenterAveraging(max_iter=self.max_iter_barycenter,

~\Anaconda3\lib\site-packages\tslearn\barycenters.py in dtw_barycenter_averaging(X, barycenter_size, init_barycenter, max_iter, tol, weights, verbose)
    326     for it in range(max_iter):
--> 327         assign = _petitjean_assignment(X_, barycenter)
    328         cost = _petitjean_cost(X_, barycenter, assign, weights)

~\Anaconda3\lib\site-packages\tslearn\barycenters.py in _petitjean_assignment(X, barycenter)
    239     for i in range(n):
--> 240         path, _ = dtw_path(X[i], barycenter)
    241         for pair in path:

~\Anaconda3\lib\site-packages\tslearn\metrics.py in dtw_path(s1, s2, global_constraint, sakoe_chiba_radius)
     79         return cydtw_path(s1, s2, mask=itakura_mask(sz1, sz2))
---> 80     return cydtw_path(s1, s2, mask=numpy.zeros((sz1, sz2)))
     81 

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
   2017                         # in the engines. This should return a list of strings.
-> 2018                         stb = value._render_traceback_()
   2019                     except Exception:

AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in get_records(self, etb, number_of_lines_of_context, tb_offset)
   1094             # (5 blanks lines) where none should be returned.
-> 1095             return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
   1096         except UnicodeDecodeError:

~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in wrapped(*args, **kwargs)
    312         try:
--> 313             return f(*args, **kwargs)
    314         finally:

~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in _fixed_getinnerframes(etb, context, tb_offset)
    346 
--> 347     records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
    348     # If the error is at the console, don't build any context, since it would

~\Anaconda3\lib\inspect.py in getinnerframes(tb, context)
   1487     while tb:
-> 1488         frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
   1489         framelist.append(FrameInfo(*frameinfo))

~\Anaconda3\lib\inspect.py in getframeinfo(frame, context)
   1445 
-> 1446     filename = getsourcefile(frame) or getfile(frame)
   1447     if context > 0:

~\Anaconda3\lib\inspect.py in getsourcefile(object)
    692         return None
--> 693     if os.path.exists(filename):
    694         return filename

~\Anaconda3\lib\genericpath.py in exists(path)
     18     try:
---> 19         os.stat(path)
     20     except OSError:

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
   2020                         stb = self.InteractiveTB.structured_traceback(etype,
-> 2021                                             value, tb, tb_offset=tb_offset)
   2022 

~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1378         return FormattedTB.structured_traceback(
-> 1379             self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1380 

~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1281             return VerboseTB.structured_traceback(
-> 1282                 self, etype, value, tb, tb_offset, number_of_lines_of_context
   1283             )

~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
   1132         formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context,
-> 1133                                                                tb_offset)
   1134 

~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in format_exception_as_a_whole(self, etype, evalue, etb, number_of_lines_of_context, tb_offset)
   1070         head = self.prepare_header(etype, self.long_header)
-> 1071         records = self.get_records(etb, number_of_lines_of_context, tb_offset)
   1072 

~\Anaconda3\lib\site-packages\IPython\core\ultratb.py in get_records(self, etb, number_of_lines_of_context, tb_offset)
   1109             inspect_error()
-> 1110             traceback.print_exc(file=self.ostream)
   1111             info('\nUnfortunately, your original traceback can not be constructed.\n')

~\Anaconda3\lib\traceback.py in print_exc(limit, file, chain)
    162     """Shorthand for 'print_exception(*sys.exc_info(), limit, file)'."""
--> 163     print_exception(*sys.exc_info(), limit=limit, file=file, chain=chain)
    164 

~\Anaconda3\lib\traceback.py in print_exception(etype, value, tb, limit, file, chain)
    103     for line in TracebackException(
--> 104             type(value), value, tb, limit=limit).format(chain=chain):
    105         print(line, file=file, end="")

~\Anaconda3\lib\traceback.py in __init__(self, exc_type, exc_value, exc_traceback, limit, lookup_lines, capture_locals, _seen)
    508             walk_tb(exc_traceback), limit=limit, lookup_lines=lookup_lines,
--> 509             capture_locals=capture_locals)
    510         self.exc_type = exc_type

~\Anaconda3\lib\traceback.py in extract(klass, frame_gen, limit, lookup_lines, capture_locals)
    363             for f in result:
--> 364                 f.line
    365         return result

~\Anaconda3\lib\traceback.py in line(self)
    285         if self._line is None:
--> 286             self._line = linecache.getline(self.filename, self.lineno).strip()
    287         return self._line

~\Anaconda3\lib\linecache.py in getline(filename, lineno, module_globals)
     15 def getline(filename, lineno, module_globals=None):
---> 16     lines = getlines(filename, module_globals)
     17     if 1 <= lineno <= len(lines):

~\Anaconda3\lib\linecache.py in getlines(filename, module_globals)
     46     try:
---> 47         return updatecache(filename, module_globals)
     48     except MemoryError:

~\Anaconda3\lib\linecache.py in updatecache(filename, module_globals)
     94     try:
---> 95         stat = os.stat(fullname)
     96     except OSError:

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_code(self, code_obj, result, async_)
   3282             if result is not None:
   3283                 result.error_in_exec = sys.exc_info()[1]
-> 3284             self.showtraceback(running_compiled_code=True)
   3285         else:
   3286             outflag = False

~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
   2031 
   2032         except KeyboardInterrupt:
-> 2033             print('\n' + self.get_exception_only(), file=sys.stderr)
   2034 
   2035     def _showtraceback(self, etype, evalue, stb):

~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in get_exception_only(self, exc_tuple)
   1976         """
   1977         etype, value, tb = self._get_exc_info(exc_tuple)
-> 1978         msg = traceback.format_exception_only(etype, value)
   1979         return ''.join(msg)
   1980 

~\Anaconda3\lib\traceback.py in format_exception_only(etype, value)
    138 
    139     """
--> 140     return list(TracebackException(etype, value, None).format_exception_only())
    141 
    142 

~\Anaconda3\lib\traceback.py in __init__(self, exc_type, exc_value, exc_traceback, limit, lookup_lines, capture_locals, _seen)
    520             self.msg = exc_value.msg
    521         if lookup_lines:
--> 522             self._load_lines()
    523 
    524     @classmethod

~\Anaconda3\lib\traceback.py in _load_lines(self)
    532             frame.line
    533         if self.__context__:
--> 534             self.__context__._load_lines()
    535         if self.__cause__:
    536             self.__cause__._load_lines()

~\Anaconda3\lib\traceback.py in _load_lines(self)
    530         """Private API. force all lines in the stack to be loaded."""
    531         for frame in self.stack:
--> 532             frame.line
    533         if self.__context__:
    534             self.__context__._load_lines()

~\Anaconda3\lib\traceback.py in line(self)
    284     def line(self):
    285         if self._line is None:
--> 286             self._line = linecache.getline(self.filename, self.lineno).strip()
    287         return self._line
    288 

~\Anaconda3\lib\linecache.py in getline(filename, lineno, module_globals)
     14 
     15 def getline(filename, lineno, module_globals=None):
---> 16     lines = getlines(filename, module_globals)
     17     if 1 <= lineno <= len(lines):
     18         return lines[lineno-1]

~\Anaconda3\lib\linecache.py in getlines(filename, module_globals)
     45 
     46     try:
---> 47         return updatecache(filename, module_globals)
     48     except MemoryError:
     49         clearcache()

~\Anaconda3\lib\linecache.py in updatecache(filename, module_globals)
    134             return []
    135     try:
--> 136         with tokenize.open(fullname) as fp:
    137             lines = fp.readlines()
    138     except OSError:

~\Anaconda3\lib\tokenize.py in open(filename)
    450     detect_encoding().
    451     """
--> 452     buffer = _builtin_open(filename, 'rb')
    453     try:
    454         encoding, lines = detect_encoding(buffer.readline)

KeyboardInterrupt: 
In [ ]:
clusterer
In [ ]:
clusterer.labels_
In [65]:
%%time
# Setting the seed
clusterer.fit(reduced_heart_trends)
cluster_assignments = clusterer.labels_
sil_score = silhouette_score(reduced_heart_trends, cluster_assignments)
print(clusterer.n_clusters, sil_score)
np.unique(cluster_assignments, return_counts=True)
2 0.18196322638456272
Wall time: 5.85 ms
In [66]:
print(np.unique(cluster_assignments, return_counts=True))
(array([0, 1], dtype=int64), array([107, 165], dtype=int64))
In [67]:
# Update the number of activity clusters based on the minimum amount of records assigned to a cluster
num_activity_clusters = min(num_activity_clusters, *(np.unique(cluster_assignments, return_counts=True)[1]))
print('Updated Number of activity clusters:', num_activity_clusters)
Updated Number of activity clusters: 8
In [68]:
# Visualizing the number of points in each cluster
sns.distplot(cluster_assignments, kde=False)
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x2adc2374f98>

Visualization of Clusters

In [69]:
# Simple Cluster Visualization
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
plt.figure(figsize=(7, 5))
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=cluster_assignments, style=cluster_assignments)
plt.xlabel('PCA Dim 1')
plt.ylabel('PCA Dim 2')
plt.title('Clusters Visualized')
plt.legend([f'Cluster: {i+1}' for i in range(4)])
Out[69]:
<matplotlib.legend.Legend at 0x2adc22abe48>
In [70]:
# Cluster Visualization based on Sleep Efficiency
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
plt.figure(figsize=(7, 5))
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=final_sleep_labels, style=cluster_assignments)
plt.xlabel('PCA Dim 1')
plt.ylabel('PCA Dim 2')
plt.title('Clusters Visualized')
plt.legend([])
Out[70]:
<matplotlib.legend.Legend at 0x2adc3d7cf60>
In [71]:
fig, ax = plt.subplots(1, 2, figsize=(15, 7))

# Simple Cluster Visualization
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=cluster_assignments, style=cluster_assignments, ax=ax[0])
ax[0].set_xlabel('PCA Dim 1')
ax[0].set_ylabel('PCA Dim 2')
ax[0].set_title('Clusters Visualized')
ax[0].legend([f'Cluster: {i+1}' for i in range(4)])

# Cluster Visualization based on Sleep Efficiency
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=final_sleep_labels, style=cluster_assignments, ax=ax[1])
ax[1].set_xlabel('PCA Dim 1')
ax[1].set_ylabel('PCA Dim 2')
ax[1].set_title('Clusters Visualized')
ax[1].legend([])
Out[71]:
<matplotlib.legend.Legend at 0x2adc3df4e48>

Cluster Purity

Finding cluster purity based on the sleep labels

In [74]:
# Clustering Purity is defined by ratio of dominant class of sleep label instance in the cluster 
# to total number of instances in the cluster
for master_cluster_num in np.unique(cluster_assignments):
    cluster_sleep_labels = final_sleep_labels[cluster_assignments == master_cluster_num]
    pos_sleep_label_purity = sum(cluster_sleep_labels) / cluster_sleep_labels.shape[0]
    print(f'Cluster Number: {master_cluster_num}, Purity:', max(pos_sleep_label_purity, 1 - pos_sleep_label_purity))
Cluster Number: 0, Purity: 0.719626168224299
Cluster Number: 1, Purity: 0.6484848484848484
In [96]:
# Constructing a histogram plot for visualizing the sleep efficiency cluster purity in all cluster.
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(np.array(final_sleep_labels[cluster_assignments==0], dtype=np.int16), ax = ax[0, 0], kde=False)
ax[0, 0].set_xlabel('Good Sleep?')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('Cluster 1')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==1], dtype=np.int16), ax = ax[0, 1], kde=False)
ax[0, 1].set_xlabel('Good Sleep?')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('Cluster 2')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==2], dtype=np.int16), ax = ax[1, 0], kde=False)
ax[1, 0].set_xlabel('Good Sleep?')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('Cluster 3')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==3], dtype=np.int16), ax = ax[1, 1], kde=False)
ax[1, 1].set_xlabel('Good Sleep?')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('Cluster 4')
C:\Users\Saksham\Anaconda3\lib\site-packages\seaborn\distributions.py:198: RuntimeWarning: Mean of empty slice.
  line, = ax.plot(a.mean(), 0)
C:\Users\Saksham\Anaconda3\lib\site-packages\numpy\core\_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Out[96]:
Text(0.5, 1.0, 'Cluster 4')

Activity Histograms for Clusters

Cluster: 1

In [65]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==0), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[65]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [66]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[66]:
<matplotlib.legend.Legend at 0x11207917e10>

Cluster: 2

In [67]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==1), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[67]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [68]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[68]:
<matplotlib.legend.Legend at 0x11207e62ef0>

Cluster: 3

In [69]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==2), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[69]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [70]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[70]:
<matplotlib.legend.Legend at 0x112095ebd68>

Cluster: 4

In [71]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==3), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[71]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [72]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[72]:
<matplotlib.legend.Legend at 0x1120a2209e8>

Sub-Clustering on Activity Data

In [75]:
sub_clusters = activity_percentage_clusterer(TimeSeriesKMeans(num_activity_clusters, metric='dtw', max_iter=50), cluster_assignments, activity_percentages)
2024.084 --> 1633.619 --> 1566.092 --> 1555.393 --> 1555.393 --> 
1136.539 --> 958.523 --> 872.808 --> 842.940 --> 820.857 --> 811.949 --> 806.796 --> 799.560 --> 794.643 --> 789.039 --> 781.425 --> 777.462 --> 774.717 --> 768.324 --> 757.301 --> 751.612 --> 749.775 --> 748.059 --> 747.488 --> 746.872 --> 745.744 --> 745.426 --> 745.426 --> 
In [76]:
# Sanity Check for the number of points in each cluster
print(np.unique(cluster_assignments, return_counts=True))
for sub_cluster in sub_clusters:
    print(sub_cluster.shape)
(array([0, 1], dtype=int64), array([107, 165], dtype=int64))
(107,)
(165,)
Cluster Purity in each subcluster
In [78]:
# Clustering Purity is defined by ratio of dominant class of sleep label instance in the cluster
# to total number of instances in the cluster
for index, sub_cluster in enumerate(sub_clusters):
    print('Master Cluster:', index+1)
    cluster_sleep_labels = final_sleep_labels[(cluster_assignments == index)]
    for sub_cluster_assignment in range(num_activity_clusters):
        sub_cluster_sleep_labels = cluster_sleep_labels[(sub_cluster==sub_cluster_assignment)]
        try:
            pos_sleep_label_purity = sum(sub_cluster_sleep_labels) / sub_cluster_sleep_labels.shape[0]
            print(f'Sub Cluster Number: {sub_cluster_assignment}, Purity:', max(pos_sleep_label_purity, 1 - pos_sleep_label_purity))
        except:
            print(f'Sub Cluster Number: {sub_cluster_assignment}, No Points assigned')
Master Cluster: 1
Sub Cluster Number: 0, Purity: 0.8636363636363636
Sub Cluster Number: 1, Purity: 0.5652173913043479
Sub Cluster Number: 2, Purity: 0.6666666666666666
Sub Cluster Number: 3, Purity: 0.6666666666666667
Sub Cluster Number: 4, Purity: 0.9032258064516129
Sub Cluster Number: 5, Purity: 0.5714285714285714
Sub Cluster Number: 6, Purity: 0.5
Sub Cluster Number: 7, Purity: 1.0
Master Cluster: 2
Sub Cluster Number: 0, Purity: 0.7
Sub Cluster Number: 1, Purity: 0.6551724137931034
Sub Cluster Number: 2, Purity: 0.6666666666666666
Sub Cluster Number: 3, Purity: 0.5714285714285714
Sub Cluster Number: 4, Purity: 0.6428571428571429
Sub Cluster Number: 5, Purity: 0.7222222222222222
Sub Cluster Number: 6, Purity: 0.64
Sub Cluster Number: 7, Purity: 0.5806451612903226
In [79]:
sleep_recipes = get_good_sleep_recipes(cluster_assignments, sub_clusters, activity_percentages, final_sleep_labels)
sleep_recipes
Cluster: 0, Sub Cluster: 2, Good Ratio: 2.0
Cluster: 1, Sub Cluster: 0, Good Ratio: 2.3333333333333335
Cluster: 1, Sub Cluster: 2, Good Ratio: 2.0
Cluster: 1, Sub Cluster: 5, Good Ratio: 2.6
Out[79]:
array([[1408.5      ,   26.06836  ,    2.520703 ,    3.0234375],
       [1150.457    ,  277.61786  ,    5.205134 ,    6.7181087],
       [1215.       ,  117.506256 ,   28.005468 ,   79.48125  ],
       [1226.1808   ,  202.0673   ,    6.651293 ,    4.923566 ]],
      dtype=float32)

K-Means - KL Divergence

Here we apply K-Means on the data with K-L Divergence as the distance metric

Defining the distance function using the K-L Divergence

In [63]:
def k_l_distance(x, y):
    return (entropy(x, y) + entropy(y, x))/ 2
In [54]:
kl_dist_heart = cdist(ordered_heart_trends, ordered_heart_trends, metric=k_l_distance)
kl_dist_calories = cdist(ordered_calories_trends, ordered_calories_trends, metric=k_l_distance)
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.heatmap(kl_dist_heart, xticklabels=10, yticklabels=10, ax=ax[0])
ax[0].set_title('All Sleep K-L Divergence Cross Matrix for Heart Trends')
sns.heatmap(kl_dist_calories, xticklabels=10, yticklabels=10, ax=ax[1])
ax[1].set_title('All Sleep K-L Divergence Cross Matrix for Calories Trends')
Out[54]:
Text(0.5, 1.0, 'All Sleep K-L Divergence Cross Matrix for Calories Trends')

Best Model

In [64]:
kl_best_mod = get_best_clustering_model(lambda num_clusters: KL_Kmeans(num_clusters), reduced_heart_trends, 
                                        sil_score_distance_metric=k_l_distance)
kmeans: X (272, 48)  centres (2, 48)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8A229198>>
kmeans: 7 iterations  cluster sizes: [115 157]
kmeans: X (272, 48)  centres (3, 48)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8A229A20>>
kmeans: 6 iterations  cluster sizes: [130  54  88]
kmeans: X (272, 48)  centres (4, 48)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8A229198>>
kmeans: 11 iterations  cluster sizes: [54 50 85 83]
kmeans: X (272, 48)  centres (5, 48)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8A229A20>>
kmeans: 8 iterations  cluster sizes: [31 66 47 58 70]
kmeans: X (272, 48)  centres (6, 48)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8A229198>>
kmeans: 8 iterations  cluster sizes: [51 67 37 37 43 37]
kmeans: X (272, 48)  centres (7, 48)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8A229A20>>
kmeans: 14 iterations  cluster sizes: [60 48 55  9 28 39 33]
kmeans: X (272, 48)  centres (8, 48)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8A229198>>
kmeans: 7 iterations  cluster sizes: [31 34  1 24 46 53 27 56]

Fitting the Model

In [65]:
# Set the seed so that get the same clustering everytime
# random.seed(2)
# np.random.seed(1000)
# Performing the Clustering
# randomcentres = randomsample(reduced_heart_trends, kl_best_mod.get_num_clusters())
randomcentres = randomsample(reduced_heart_trends, 4)
centres, cluster_assignments, dist = kmeans(reduced_heart_trends, randomcentres, metric=k_l_distance, maxiter=200)
sil_score = silhouette_score(reduced_heart_trends, cluster_assignments, metric=k_l_distance)
print(len(centres), sil_score)
np.unique(cluster_assignments, return_counts=True)
kmeans: X (272, 48)  centres (4, 48)  delta=0.001  maxiter=200  metric=<function k_l_distance at 0x0000023F8D289950>
kmeans: 7 iterations  cluster sizes: [58 69 72 73]
4 0.19436746816575762
Out[65]:
(array([0, 1, 2, 3], dtype=int64), array([58, 69, 72, 73], dtype=int64))
In [66]:
# Update the number of activity clusters based on the minimum amount of records assigned to a cluster
num_activity_clusters = min(num_activity_clusters, *(np.unique(cluster_assignments, return_counts=True)[1]))
print('Updated Number of activity clusters:', num_activity_clusters)
Updated Number of activity clusters: 2
In [67]:
# Visualizing the number of points in each cluster
sns.distplot(cluster_assignments, kde=False)
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x23f8a30fc88>

Visualization of Clusters

In [68]:
# Simple Cluster Visualization
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
plt.figure(figsize=(7, 5))
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=cluster_assignments, style=cluster_assignments)
plt.xlabel('PCA Dim 1')
plt.ylabel('PCA Dim 2')
plt.title('Clusters Visualized')
plt.legend([f'Cluster: {i+1}' for i in range(4)])
Out[68]:
<matplotlib.legend.Legend at 0x23f8d30b668>
In [69]:
# Cluster Visualization based on Sleep Efficiency
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
plt.figure(figsize=(7, 5))
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=final_sleep_labels, style=cluster_assignments)
plt.xlabel('PCA Dim 1')
plt.ylabel('PCA Dim 2')
plt.title('Clusters Visualized')
plt.legend([])
Out[69]:
<matplotlib.legend.Legend at 0x23f8d530cf8>
In [70]:
fig, ax = plt.subplots(1, 2, figsize=(15, 7))

# Simple Cluster Visualization
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=cluster_assignments, style=cluster_assignments, ax=ax[0])
# sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=cluster_assignments, size=cluster_assignments, ax=ax[0])
ax[0].set_xlabel('PCA Dim 1')
ax[0].set_ylabel('PCA Dim 2')
ax[0].set_title('Clusters Visualized')
ax[0].legend([f'Cluster: {i+1}' for i in range(4)])

# Cluster Visualization based on Sleep Efficiency
pca_mod = PCA(2)
pca_heart = pca_mod.fit_transform(reduced_heart_trends)
sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=final_sleep_labels, style=cluster_assignments, ax=ax[1])
# sns.scatterplot(pca_heart[:, 0], pca_heart[:, 1], hue=final_sleep_labels, size=cluster_assignments, ax=ax[1])
ax[1].set_xlabel('PCA Dim 1')
ax[1].set_ylabel('PCA Dim 2')
ax[1].set_title('Clusters Visualized')
ax[1].legend([])
Out[70]:
<matplotlib.legend.Legend at 0x23f8d5b2a90>

Cluster Purity

Finding cluster purity based on the sleep labels

In [71]:
# Clustering Purity is defined by ratio of dominant class of sleep label instance in the cluster 
# to total number of instances in the cluster
for master_cluster_num in range(len(centres)):
    cluster_sleep_labels = final_sleep_labels[cluster_assignments == master_cluster_num]
    pos_sleep_label_purity = sum(cluster_sleep_labels) / cluster_sleep_labels.shape[0]
    print(f'Cluster Number: {master_cluster_num}, Purity:', max(pos_sleep_label_purity, 1 - pos_sleep_label_purity))
Cluster Number: 0, Purity: 0.7931034482758621
Cluster Number: 1, Purity: 0.6376811594202898
Cluster Number: 2, Purity: 0.7638888888888888
Cluster Number: 3, Purity: 0.6438356164383562
In [72]:
# Constructing a histogram plot for visualizing the sleep efficiency cluster purity in all cluster.
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(np.array(final_sleep_labels[cluster_assignments==0], dtype=np.int16), ax = ax[0, 0], kde=False)
ax[0, 0].set_xlabel('Good Sleep?')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('Cluster 1')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==1], dtype=np.int16), ax = ax[0, 1], kde=False)
ax[0, 1].set_xlabel('Good Sleep?')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('Cluster 2')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==2], dtype=np.int16), ax = ax[1, 0], kde=False)
ax[1, 0].set_xlabel('Good Sleep?')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('Cluster 3')

sns.distplot(np.array(final_sleep_labels[cluster_assignments==3], dtype=np.int16), ax = ax[1, 1], kde=False)
ax[1, 1].set_xlabel('Good Sleep?')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('Cluster 4')
Out[72]:
Text(0.5, 1.0, 'Cluster 4')

Activity Histograms for Clusters

Cluster: 1

In [64]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==0), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==0), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[64]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [65]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==0) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==0) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[65]:
<matplotlib.legend.Legend at 0x211ad304828>

Cluster: 2

In [66]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==1), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==1), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[66]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [67]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==1) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==1) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[67]:
<matplotlib.legend.Legend at 0x211adb49518>

Cluster: 3

In [68]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==2), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==2), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[68]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [69]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==2) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==2) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[69]:
<matplotlib.legend.Legend at 0x211ae0e5c18>

Cluster: 4

In [70]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==3), 0], ax = ax[0, 0])
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 1], ax = ax[0, 1])
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 2], ax = ax[1, 0])
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')

sns.distplot(activity_percentages[(cluster_assignments==3), 3], ax = ax[1, 1])
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
Out[70]:
Text(0.5, 1.0, '% Vigorous Activity Histogram')
In [71]:
# Constructing a histogram plot for the different activity level percentages visualizing with respect to the good sleep label
fig, ax = plt.subplots(2, 2, figsize=(15, 10))
sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 0], ax = ax[0, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 0], ax = ax[0, 0], color='green', label='Good Sleep')
ax[0, 0].set_xlabel('% Sedentary Activity')
ax[0, 0].set_ylabel('Frequency')
ax[0, 0].set_title('% Sedentary Activity Histogram')
ax[0, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 1], ax = ax[0, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 1], ax = ax[0, 1], color='green', label='Good Sleep')
ax[0, 1].set_xlabel('% Light Activity')
ax[0, 1].set_ylabel('Frequency')
ax[0, 1].set_title('% Light Activity Histogram')
ax[0, 1].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 2], ax = ax[1, 0], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 2], ax = ax[1, 0], color='green', label='Good Sleep')
ax[1, 0].set_xlabel('% Moderate Activity')
ax[1, 0].set_ylabel('Frequency')
ax[1, 0].set_title('% Moderate Activity Histogram')
ax[1, 0].legend()

sns.distplot(activity_percentages[(cluster_assignments==3) & (~final_sleep_labels), 3], ax = ax[1, 1], color='red', label='Poor Sleep')
sns.distplot(activity_percentages[(cluster_assignments==3) & (final_sleep_labels), 3], ax = ax[1, 1], color='green', label='Good Sleep')
ax[1, 1].set_xlabel('% Vigorous Activity')
ax[1, 1].set_ylabel('Frequency')
ax[1, 1].set_title('% Vigorous Activity Histogram')
ax[1, 1].legend()
Out[71]:
<matplotlib.legend.Legend at 0x211afce99e8>

Sub-Clustering on Activity Data

In [81]:
sub_clusters = activity_percentage_clusterer(KL_Kmeans(num_clusters=8), cluster_assignments, activity_percentages)
kmeans: X (58, 4)  centres (8, 4)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8D6E38D0>>
kmeans: 2 iterations  cluster sizes: [ 9 12 11 12  3  3  3  5]
kmeans: X (69, 4)  centres (8, 4)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8D6E38D0>>
kmeans: 9 iterations  cluster sizes: [ 7 18  4  2  8  4 15 11]
kmeans: X (72, 4)  centres (8, 4)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8D6E38D0>>
kmeans: 5 iterations  cluster sizes: [ 5  7  8 13 14  9 15  1]
kmeans: X (73, 4)  centres (8, 4)  delta=0.001  maxiter=100  metric=<bound method KL_Kmeans.k_l_distance of <kmeans_dm.KL_Kmeans object at 0x0000023F8D6E38D0>>
kmeans: 2 iterations  cluster sizes: [10  4  3  9  4 14 19 10]
In [82]:
# Sanity Check for the number of points in each cluster
print(np.unique(cluster_assignments, return_counts=True))
for sub_cluster in sub_clusters:
    print(sub_cluster.shape)
(array([0, 1, 2, 3], dtype=int64), array([58, 69, 72, 73], dtype=int64))
(58,)
(69,)
(72,)
(73,)
Cluster Purity in each subcluster
In [83]:
# Clustering Purity is defined by ratio of dominant class of sleep label instance in the cluster
# to total number of instances in the cluster
for index, sub_cluster in enumerate(sub_clusters):
    print('Master Cluster:', index+1)
    cluster_sleep_labels = final_sleep_labels[(cluster_assignments == index)]
    for sub_cluster_assignment in range(num_activity_clusters):
        sub_cluster_sleep_labels = cluster_sleep_labels[(sub_cluster==sub_cluster_assignment)]
        try:
            pos_sleep_label_purity = sum(sub_cluster_sleep_labels) / sub_cluster_sleep_labels.shape[0]
            print(f'Sub Cluster Number: {sub_cluster_assignment}, Purity:', max(pos_sleep_label_purity, 1 - pos_sleep_label_purity))
            print(f'Sub Cluster Number: {sub_cluster_assignment}, Good Sleep %:', pos_sleep_label_purity)
        except:
            print(f'Sub Cluster Number: {sub_cluster_assignment}, No Points assigned')
Master Cluster: 1
Sub Cluster Number: 0, Purity: 0.7777777777777778
Sub Cluster Number: 0, Good Sleep %: 0.2222222222222222
Sub Cluster Number: 1, Purity: 0.8333333333333334
Sub Cluster Number: 1, Good Sleep %: 0.16666666666666666
Master Cluster: 2
Sub Cluster Number: 0, Purity: 0.5714285714285714
Sub Cluster Number: 0, Good Sleep %: 0.42857142857142855
Sub Cluster Number: 1, Purity: 0.7777777777777778
Sub Cluster Number: 1, Good Sleep %: 0.7777777777777778
Master Cluster: 3
Sub Cluster Number: 0, Purity: 0.8
Sub Cluster Number: 0, Good Sleep %: 0.8
Sub Cluster Number: 1, Purity: 0.7142857142857143
Sub Cluster Number: 1, Good Sleep %: 0.7142857142857143
Master Cluster: 4
Sub Cluster Number: 0, Purity: 0.7
Sub Cluster Number: 0, Good Sleep %: 0.3
Sub Cluster Number: 1, Purity: 0.75
Sub Cluster Number: 1, Good Sleep %: 0.75
In [84]:
sleep_recipes = get_good_sleep_recipes(cluster_assignments, sub_clusters, activity_percentages, final_sleep_labels, good_sleep_ratio=1.)
sleep_recipes
Cluster: 1, Sub Cluster: 1, Good Ratio: 3.5
Cluster: 1, Sub Cluster: 2, Good Ratio: 3.0
Cluster: 1, Sub Cluster: 3, Good Ratio: inf
Cluster: 1, Sub Cluster: 4, Good Ratio: 1.0
Cluster: 1, Sub Cluster: 6, Good Ratio: 6.5
Cluster: 2, Sub Cluster: 0, Good Ratio: 4.0
Cluster: 2, Sub Cluster: 1, Good Ratio: 2.5
Cluster: 2, Sub Cluster: 2, Good Ratio: 1.6666666666666667
Cluster: 2, Sub Cluster: 3, Good Ratio: 5.5
Cluster: 2, Sub Cluster: 4, Good Ratio: 3.6666666666666665
Cluster: 2, Sub Cluster: 5, Good Ratio: 3.5
Cluster: 2, Sub Cluster: 6, Good Ratio: 4.0
Cluster: 3, Sub Cluster: 1, Good Ratio: 3.0
Cluster: 3, Sub Cluster: 4, Good Ratio: 1.0
Cluster: 3, Sub Cluster: 6, Good Ratio: 1.1111111111111112
Out[84]:
array([[84.4   , 15.625 ,  0.    ,  0.    ],
       [72.7   , 27.31  ,  0.    ,  0.    ],
       [76.94  , 22.1   ,  0.9897,  0.    ],
       [86.75  , 11.61  ,  1.302 ,  0.3384],
       [76.6   , 20.78  ,  1.21  ,  1.411 ],
       [74.    , 24.97  ,  1.016 ,  0.    ],
       [63.25  , 34.66  ,  0.896 ,  1.1455],
       [92.1   ,  7.918 ,  0.    ,  0.    ],
       [82.25  , 14.67  ,  1.506 ,  1.601 ],
       [86.7   , 13.305 ,  0.    ,  0.    ],
       [77.3   , 21.16  ,  0.952 ,  0.61  ],
       [78.3   , 21.67  ,  0.    ,  0.    ],
       [69.3   , 29.69  ,  1.007 ,  0.    ],
       [59.7   , 18.53  ,  9.22  , 12.555 ],
       [78.4   , 21.6   ,  0.    ,  0.    ]], dtype=float16)
In [85]:
for i, sleep_recipe in enumerate(sleep_recipes):
    plt.figure(i)
    plt.bar(['S', 'L', 'M', 'V'], (sleep_recipe / 1440 * 100))
In [ ]: